50 research outputs found
A Novel Architecture of Parking Management for Smart Cities
AbstractParking is becoming an expensive resource in almost any major city in the world. Current technically advanced solutions for parking management are concerned with the application of secured wireless network and sensor communication for parking reservation. Moreover new rules concerning financial transactions in mobile payment allow the definition of new intelligent frameworks that enable a convenient management of public parking in urban area. The paper discusses the conceptual architecture of IPA (Intelligent Parking Assistant) which aims at overcoming current parking management solutions and thereby becoming a leading paradigm for the so called “smart cities”
Boosting End-to-End Multilingual Phoneme Recognition through Exploiting Universal Speech Attributes Constraints
We propose a first step toward multilingual end-to-end automatic speech
recognition (ASR) by integrating knowledge about speech articulators. The key
idea is to leverage a rich set of fundamental units that can be defined
"universally" across all spoken languages, referred to as speech attributes,
namely manner and place of articulation. Specifically, several deterministic
attribute-to-phoneme mapping matrices are constructed based on the predefined
set of universal attribute inventory, which projects the knowledge-rich
articulatory attribute logits, into output phoneme logits. The mapping puts
knowledge-based constraints to limit inconsistency with acoustic-phonetic
evidence in the integrated prediction. Combined with phoneme recognition, our
phone recognizer is able to infer from both attribute and phoneme information.
The proposed joint multilingual model is evaluated through phoneme recognition.
In multilingual experiments over 6 languages on benchmark datasets LibriSpeech
and CommonVoice, we find that our proposed solution outperforms conventional
multilingual approaches with a relative improvement of 6.85% on average, and it
also demonstrates a much better performance compared to monolingual model.
Further analysis conclusively demonstrates that the proposed solution
eliminates phoneme predictions that are inconsistent with attributes
S-HR-VQVAE: Sequential Hierarchical Residual Learning Vector Quantized Variational Autoencoder for Video Prediction
We address the video prediction task by putting forth a novel model that
combines (i) our recently proposed hierarchical residual vector quantized
variational autoencoder (HR-VQVAE), and (ii) a novel spatiotemporal PixelCNN
(ST-PixelCNN). We refer to this approach as a sequential hierarchical residual
learning vector quantized variational autoencoder (S-HR-VQVAE). By leveraging
the intrinsic capabilities of HR-VQVAE at modeling still images with a
parsimonious representation, combined with the ST-PixelCNN's ability at
handling spatiotemporal information, S-HR-VQVAE can better deal with chief
challenges in video prediction. These include learning spatiotemporal
information, handling high dimensional data, combating blurry prediction, and
implicit modeling of physical characteristics. Extensive experimental results
on the KTH Human Action and Moving-MNIST tasks demonstrate that our model
compares favorably against top video prediction techniques both in quantitative
and qualitative evaluations despite a much smaller model size. Finally, we
boost S-HR-VQVAE by proposing a novel training method to jointly estimate the
HR-VQVAE and ST-PixelCNN parameters.Comment: 14 pages, 7 figures, 3 tables. Submitted to IEEE Transactions on
Pattern Analysis and Machine Intelligence on 2023-07-1
Una estrategia de procesamiento automático del habla basada en la detección de atributos
State-of-the-art automatic speech and speaker recognition systems are often built with a pattern matching framework that has proven to achieve low recognition error rates for a variety of resource-rich tasks when the volume of speech and text examples to build statistical acoustic and language models is plentiful, and the speaker, acoustics and language conditions follow a rigid protocol. However, because of the “blackbox” top-down knowledge integration approach, such systems cannot easily leverage a rich set of knowledge sources already available in the literature on speech, acoustics and languages. In this paper, we present a bottom-up approach to knowledge integration, called automatic speech attribute transcription (ASAT), which is intended to be “knowledge-rich”, so that new and existing knowledge sources can be verified and integrated into current spoken language systems to improve recognition accuracy and system robustness. Since the ASAT framework offers a “divide-and-conquer” strategy and a “plug-andplay” game plan, it will facilitate a cooperative speech processing community that every researcher can contribute to, with a view to improving speech processing capabilities which are currently not easily accessible to researchers in the speech science community.Los sistemas más novedosos de reconocimiento automático de habla y de locutor suelen basarse en un sistema de coincidencia de patrones. Gracias a este modo de trabajo, se han obtenido unos bajos índices de error de reconocimiento para una variedad de tareas ricas en recursos, cuando se aporta una cantidad abundante de ejemplos de habla y texto para el entrenamiento estadístico de los modelos acústicos y de lenguaje, y siempre que el locutor y las condiciones acústicas y lingüísticas sigan un protocolo estricto. Sin embargo, debido a su aplicación de un proceso ciego de integración del conocimiento de arriba a abajo, dichos sistemas no pueden aprovechar fácilmente toda una serie de conocimientos ya disponibles en la literatura sobre el habla, la acústica y las lenguas. En este artículo presentamos una aproximación de abajo a arriba a la integración del conocimiento, llamada transcripción automática de atributos del habla (conocida en inglés como automatic speech attribute transcription, ASAT). Dicho enfoque pretende ser “rico en conocimiento”, con el fin de poder verificar las fuentes de conocimiento, tanto nuevas como ya existentes, e integrarlas en los actuales sistemas de lengua hablada para mejorar la precisión del reconocimiento y la robustez del sistema. Dado que ASAT ofrece una estrategia de tipo “divide y vencerás” y un plan de juego de “instalación y uso inmediato” (en inglés, plugand-play), esto facilitará una comunidad cooperativa de procesamiento del habla a la que todo investigador pueda contribuir con vistas a mejorar la capacidad de procesamiento del habla, que en la actualidad no es fácilmente accesible a los investigadores de la comunidad de las ciencias del habla
Embedded Knowledge-based Speech Detectors for Real-Time Recognition Tasks
Speech recognition has become common in many application domains, from dictation systems for professional practices to vocal user interfaces for people with disabilities or hands-free system control. However, so far the performance of automatic speech recognition (ASR) systems are comparable to human speech recognition (HSR) only under very strict working conditions, and in general much lower. Incorporating acoustic-phonetic knowledge into ASR design has been proven a viable approach to raise ASR accuracy. Manner of articulation attributes such as vowel, stop, fricative, approximant, nasal, and silence are examples of such knowledge. Neural networks have already been used successfully as detectors for manner of articulation attributes starting from representations of speech signal frames. In this paper, the full system implementation is described. The system has a first stage for MFCC extraction followed by a second stage implementing a sinusoidal based multi-layer perceptron for speech event classification. Implementation details over a Celoxica RC203 board are give
A Multi-dimensional Deep Structured State Space Approach to Speech Enhancement Using Small-footprint Models
We propose a multi-dimensional structured state space (S4) approach to speech
enhancement. To better capture the spectral dependencies across the frequency
axis, we focus on modifying the multi-dimensional S4 layer with whitening
transformation to build new small-footprint models that also achieve good
performance. We explore several S4-based deep architectures in time (T) and
time-frequency (TF) domains. The 2-D S4 layer can be considered a particular
convolutional layer with an infinite receptive field although it utilizes fewer
parameters than a conventional convolutional layer. Evaluated on the
VoiceBank-DEMAND data set, when compared with the conventional U-net model
based on convolutional layers, the proposed TF-domain S4-based model is 78.6%
smaller in size, yet it still achieves competitive results with a PESQ score of
3.15 with data augmentation. By increasing the model size, we can even reach a
PESQ score of 3.18.Comment: Accepted to Interspeech 2023. Code will be released at
https://github.com/Kuray107/S4ND-U-Net_speech_enhancemen